Overview

In this assessment we aim to use the MACCDC conn data to perform data analysis and modelling. We first must import the data.

mydata <- read.csv("MAC.csv")
mydata <- data.frame(mydata)
mydata

We first want to look for missing data. Service, duration, orig_bytes, resp_bytes and local_orig all seem to have missing data in them so we will see what percentage.

mtab0=data.frame(
    missingduration=is.na(mydata[,"duration"]),
    proto=mydata[,"proto"])
mtab0=table(mtab0)
(apply(mtab0,2,function(x)x/sum(x)))
               proto
missingduration      icmp       tcp       udp
          FALSE 0.8585338 0.1656118 0.3089144
          TRUE  0.1414662 0.8343882 0.6910856
mtab1=data.frame(
    missing_orig_bytes=is.na(mydata[,"orig_bytes"]),
    proto=mydata[,"proto"])
mtab1=table(mtab1)
(apply(mtab1,2,function(x)x/sum(x)))
                  proto
missing_orig_bytes      icmp       tcp       udp
             FALSE 0.8585338 0.1656118 0.3089144
             TRUE  0.1414662 0.8343882 0.6910856
mtab2=data.frame(
    missing_resp_bytes=is.na(mydata[,"resp_bytes"]),
    proto=mydata[,"proto"])
mtab2=table(mtab2)
(apply(mtab2,2,function(x)x/sum(x)))
                  proto
missing_resp_bytes      icmp       tcp       udp
             FALSE 0.8585338 0.1656118 0.3089144
             TRUE  0.1414662 0.8343882 0.6910856
mtab3=data.frame(
    missing_local_orig=is.na(mydata[,"local_orig"]),
    proto=mydata[,"proto"])
mtab3=table(mtab3)
(apply(mtab3,2,function(x)x/sum(x)))
icmp  tcp  udp 
   1    1    1 

Thus we are missing the local_orig feature for every data point in the data set. We may then consider dropping this entire column as it serves no use to us and we cannot impute the data without prior knowledge of the data set and what it should look like. The duration, orig_bytes and resp_bytes all appear to be missing exactly the same data - on further analysis, we see that whenever one is missing, all three are missing.

Some initial data cleansing will come from removing the X column and the ts column. The X column is produced by the sampling and since we have a random sample of the data, the ts provides no real information on the data.

unique_uid <- mydata[!duplicated(mydata[,c('uid')]),]
unique_uid

Thus all our uid’s are unique and therefore wont provide us with any extra information either since they will be uncorrelated with the rest of the data. This is the only column with this trait, and all other columns have values which occur more than once so we can drop the uid column too.

drop_columns <- c("X","ts","local_orig","uid")
mydata <- mydata[, !names(mydata) %in% drop_columns]
head(mydata)

So we have removed the columns that didn’t provide us with any extra information. We will now extract the data we will use for DBSCAN to create clusters. The following code is pulled from Alex’s workbook and allows us to pull out 7 of the features to use for DBSCAN and ensures all elements are numeric.

miss.me <- vector(length = nrow(mydata))
miss.me <- rep(0, times = nrow(mydata))
for(i in 1:nrow(mydata)) {
    if(is.na(mydata$duration[i])) { miss.me[i] <- 1 }
    }
str(mydata)
'data.frame':   226943 obs. of  16 variables:
 $ id.orig_h    : chr  "192.168.202.110" "192.168.202.83" "192.168.202.110" "192.168.202.138" ...
 $ id.orig_p    : int  50427 46442 12662 35203 63958 39436 52132 41741 40224 63022 ...
 $ id.resp_h    : chr  "192.168.22.102" "192.168.206.44" "192.168.22.1" "192.168.24.187" ...
 $ id.resp_p    : int  15457 42 17800 10629 56587 33384 32776 7476 1688 2010 ...
 $ proto        : chr  "tcp" "tcp" "tcp" "tcp" ...
 $ service      : chr  "-" "-" "-" "-" ...
 $ duration     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ orig_bytes   : int  NA NA NA NA NA NA NA NA NA NA ...
 $ resp_bytes   : int  NA NA NA NA NA NA NA NA NA NA ...
 $ conn_state   : chr  "REJ" "REJ" "S0" "S0" ...
 $ missed_bytes : int  0 0 0 0 0 0 0 0 0 0 ...
 $ history      : chr  "Sr" "Sr" "S" "S" ...
 $ orig_pkts    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ orig_ip_bytes: int  48 60 48 44 48 44 60 60 60 44 ...
 $ resp_pkts    : int  1 1 0 0 0 1 1 0 1 0 ...
 $ resp_ip_bytes: int  40 40 0 0 0 40 40 0 40 0 ...
mydata.good <- as.data.frame(cbind(id.orig_p = mydata$id.orig_p, id.resp_p = mydata$id.resp_p, 
orig_pkts = mydata$orig_pkts, orig_ip_bytes = mydata$orig_ip_bytes, 
resp_pkts = mydata$resp_pkts, resp_ip_bytes = mydata$resp_ip_bytes))
mydata.good<- cbind(mydata.good, miss.me)
head(mydata.good)
str(mydata.good) # Should be only ints and nums
'data.frame':   226943 obs. of  7 variables:
 $ id.orig_p    : int  50427 46442 12662 35203 63958 39436 52132 41741 40224 63022 ...
 $ id.resp_p    : int  15457 42 17800 10629 56587 33384 32776 7476 1688 2010 ...
 $ orig_pkts    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ orig_ip_bytes: int  48 60 48 44 48 44 60 60 60 44 ...
 $ resp_pkts    : int  1 1 0 0 0 1 1 0 1 0 ...
 $ resp_ip_bytes: int  40 40 0 0 0 40 40 0 40 0 ...
 $ miss.me      : num  1 1 1 1 1 1 1 1 1 1 ...
for(i in 1:ncol(mydata.good)) { mydata.good[,i] <- as.numeric(mydata.good[,i]) }
str(mydata.good)        ## All should be nums now
'data.frame':   226943 obs. of  7 variables:
 $ id.orig_p    : num  50427 46442 12662 35203 63958 ...
 $ id.resp_p    : num  15457 42 17800 10629 56587 ...
 $ orig_pkts    : num  1 1 1 1 1 1 1 1 1 1 ...
 $ orig_ip_bytes: num  48 60 48 44 48 44 60 60 60 44 ...
 $ resp_pkts    : num  1 1 0 0 0 1 1 0 1 0 ...
 $ resp_ip_bytes: num  40 40 0 0 0 40 40 0 40 0 ...
 $ miss.me      : num  1 1 1 1 1 1 1 1 1 1 ...
# sum(mydata.good$miss.me)/nrow(mydata.good) ## 82.7% missing

I dont want to drop any data that may be important so I’ll also use the protocol, connection state and history features in my analysis.

proto <- as.factor(c(mydata$proto))
proto <- unclass(proto)

conn_state <- as.factor(c(mydata$conn_state))
conn_state <- unclass(conn_state)

history <- as.factor(c(mydata$history))
history <- unclass(history)

mydata.good$proto <- proto
mydata.good$conn_state <- conn_state
mydata.good$history <- history

mydata.good
    ## We'll do 10-fold CV and then apply DBSCAN, training on 90%
dg <- mydata.good
ran <- sample(1:nrow(dg), 0.9 * nrow(dg))
nor <-function(x) { (x -min(x))/(max(x)-min(x))   }
dg_norm <- as.data.frame(lapply(dg, nor))
    # head(dg_norm)

dg_train <- dg_norm[ran,]   ## extract training set
dg_test <- dg_norm[-ran,]       ## extract testing set
dg_target_cat <- dg[ran, ncol(dg)]
dg_test_cat <- dg[-ran, ncol(dg)]

SVD

Now we can look at running DBSCAN on our data. We first need to perform PCA to figure out how many principle components to use in DBSCAN.

library("dbscan")
dg_train.svd <- svd(dg_train)
plot(dg_train.svd$d,xlab="Eigenvalue index",ylab="Eigenvalue",log="y")

plot(dg_train.svd$d,xlab="Eigenvalue index",ylab="Eigenvalue")

Plotting with a log y-axis and a normal y-axis give strikingly different results. The first eigenvector explains most of the variance and the 5 after that seeming explain almost the same amount of variance between them as the first one. I don’t think using just the first eigenvalue would provide that much insight and therefore will use 6 of them (this is still reducing dimensionality by almost half).

npcs = 6

We now plot the PCA to visualise the clusters formed here. We’re not plotting according to any categorical data i.e. normal vs non-normal so we may not get that much information from this.

i=1;j=2
plot(dg_train.svd$u[,i],
     dg_train.svd$u[,j],type="p",
     col="#33333311",pch=16,cex=1)

Finding Parameters for DBSCAN

Eps specifies how close the points should be to each other to form a cluster. If the distance is less than eps, they are considered neighbours. We find this number by finding the ‘knee’ in the plot below. I have chosen to use 7 neighbours here.

test=kNNdist(dg_train.svd$u[,1:npcs], k = 7,all=TRUE)
testmin=apply(test,1,min)
plot(sort(testmin[testmin>1e-8]),log="y")
threshholds= c(0.01,0.001,0.0001,0.00001,0.000001)
abline(h=c(0.01,0.001,0.0001,0.00001,0.000001))
abline(h=0.0001, col="red")

So we choose h=0.0001 as our limit since this allows us to capture most of the information here. We also need to define our minimum number of points to form a cluster. The recommendation is to use minPts = 2*dim for large data sets to ensure we find significant clusters and avoid noise so this is what we shall choose.

##DBSCAN

Now we finally perform DBSCAN.

minPts = c(20, 25, 30, 35, 40, 45, 50, 75, 100, 125, 150, 175, 200, 225, 250)
clustercounts = c()

for(val in minPts) {
  dbscanres = dbscan(dg_train.svd$u[,1:6],eps = 0.0001,minPts = val)
  clustercounts[val] <- (length(unique(dbscanres$cluster)))
}
clustercounts
  [1]  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA 209  NA
 [22]  NA  NA  NA 143  NA  NA  NA  NA 157  NA  NA  NA  NA 136  NA  NA  NA  NA 110  NA  NA
 [43]  NA  NA  99  NA  NA  NA  NA 111  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
 [64]  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  98  NA  NA  NA  NA  NA  NA  NA  NA  NA
 [85]  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  85  NA  NA  NA  NA  NA
[106]  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  40  NA
[127]  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
[148]  NA  NA  25  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
[169]  NA  NA  NA  NA  NA  NA  17  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
[190]  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  22  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
[211]  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  21  NA  NA  NA  NA  NA  NA
[232]  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  24

Thus we seem to hit some limiting value at 175 since a minPts of 200 has a higher amount of clusters than a minPts of 175. Since we want ou data to be interpretable we may look at reducing the amount of clusters and therefore the noise. Thus we will use an initial minPts of 175.

The output we get from this isn’t very insightful though and leaves a lot to be desired.

dbscan175 = dbscan(dg_train.svd$u[,1:6],eps = 0.0001,minPts = 175)
dbscan50 = dbscan(dg_train.svd$u[,1:6],eps=0.0001,minPts = 50)
dbscan30 = dbscan(dg_train.svd$u[,1:6],eps=0.0001,minPts = 30)
library(cluster)
# trying to calculate the silhouette score of this clustering to see if its valid or not - currently reports Error: Vector memory exhausted (limit reached?) - I've tried looking into work arounds but cant get anything working so I'll leave this for now.
#ss <- silhouette(dbscanres$cluster, dist(dg_train.svd$u))

Plotting resulting clusters

for (k in 1:5){
    a = seq(k+1,6)
    for (l in a){
        if(k==l){next}
        plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab="",
            ylab="",
            col=c("#66666666",rainbow(41))[dbscan175$cluster+1],pch=19,cex=0.5)
    par(new=TRUE)
    }
}

for (k in 1:5){
    a = seq(k+1,6)
    for (l in a){
        if(k==l){next}
        plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab="",
            ylab="",
            col=c("#66666666",rainbow(41))[dbscan50$cluster+1],pch=19,cex=0.5)
    par(new=TRUE)
    }
}

for (k in 1:5){
    a = seq(k+1,6)
    for (l in a){
        if(k==l){next}
        plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab="",
            ylab="",
            col=c("#66666666",rainbow(41))[dbscan30$cluster+1],pch=19,cex=0.5)
    par(new=TRUE)
    }
}

for (k in 1:5){
    a = seq(k+1,6)
    for (l in a){
        if(k==l){next}
        plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab="",
            ylab="",
            col=c("#66666666",rainbow(41))[dbscan175$cluster+1],pch=19,cex=0.5)
    }
}

for (k in 1:5){
    a = seq(k+1,6)
    for (l in a){
        if(k==l){next}
        plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab="",
            ylab="",
            col=c("#66666666",rainbow(41))[dbscan50$cluster+1],pch=19,cex=0.5)
    }
}

for (k in 1:5){
    a = seq(k+1,6)
    for (l in a){
        if(k==l){next}
        plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab="",
            ylab="",
            col=c("#66666666",rainbow(41))[dbscan30$cluster+1],pch=19,cex=0.5)
    }
}

jpeg("Assessment2 DBSCAN Clustering.jpg")

for (k in 1:5){
    a = seq(k+1,6)
    for (l in a){
        if(k==l){next}
        plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab="",
            ylab="",
            col=c("#66666666",rainbow(41))[dbscan30$cluster+1],pch=19,cex=0.5)
    par(new=TRUE)
    }
}
jpeg("Assessment2 DBSCAN Clustering Eigenvector split.jpg")
op <- par(mfrow=c(5,3))
for (k in 1:5){
  a = seq(k+1,6)
  for (l in a){
      if(k==l){next}
      plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab=k,
            ylab=l,
            col=c("#66666666",rainbow(41))[dbscan30$cluster+1],pch=19,cex=0.5)
    }
}
par(op)
dev.off()
null device 
          1 
dg_train.clustered <- data.frame(dg_train)

dg_train.clustered$cluster <- dbscan175$cluster

dg_train.clustered

M matrix

Now we look at the M matrix produced by Alex that is a sparse matrix showing connections between origin IP’s and response IP’s.

library(Matrix)
library(irlba)
library(Rtsne)
So1 <- tapply(mydata$id.orig_h, mydata$id.orig_h)
De1 <- tapply(mydata$id.resp_h, mydata$id.resp_h)
Est <- as.matrix(cbind(So1, De1))
M<- sparseMatrix(i=Est[,1], j=Est[,2])
M.svd = svd(M)
plot(M.svd$d,xlab="Eigenvalue index",ylab="Eigenvalue",log="y")

plot(M.svd$d,xlab="Eigenvalue index",ylab="Eigenvalue")

From the log axis it looks like we need the first ~100 eigenvalues but using the normal plot, it looks like we an get away with using ~30.

npcsM = 30
testM=kNNdist(M.svd$u[,1:npcsM], k = 7,all=TRUE)
testminM=apply(testM,1,min)
plot(sort(testminM[testminM>1e-15]),log="y")
threshholds= c(0.1,0.01,0.001,0.0001,0.00001,0.000001)
abline(h=c(0.01,0.001,0.0001,0.00001,0.000001))
abline(h=0.5, col="red")

It looks like we want eps = 0.5 although we dont seem to get a knee in the data so it’s hard to pinpoint this.

MminPts = c(20, 50, 200)
clustercountsM = c()

for(val in MminPts) {
  dbscanresM = dbscan(dg_train.svd$u[,1:6],eps = 0.5,minPts = val)
  clustercountsM[val] <- (length(unique(dbscanresM$cluster)))
}

References:

  1. Data from SecRepo

  2. Converting categorical variables

  3. Adding columns to data frames

  4. Finding Unique Values

  5. DBSCAN on flowers

  6. Saving Plots

  7. DBSCAN Parameter Estimation

  8. Finding the knee in kNNDist

  9. Silhouette Score introduction

  10. Error with silhouette score

  11. Silhouette Function

---
title: "Assessment 2 - Matt"
output:
  html_notebook: default
---
# Overview
In this assessment we aim to use the MACCDC conn data to perform data analysis and modelling. We first must import the data.

```{r}
mydata <- read.csv("MAC.csv")
mydata <- data.frame(mydata)
```

```{r}
mydata
```
We first want to look for missing data. Service, duration, orig_bytes, resp_bytes and local_orig all seem to have missing data in them so we will see what percentage.

```{r}
mtab0=data.frame(
    missingduration=is.na(mydata[,"duration"]),
    proto=mydata[,"proto"])
mtab0=table(mtab0)
(apply(mtab0,2,function(x)x/sum(x)))

mtab1=data.frame(
    missing_orig_bytes=is.na(mydata[,"orig_bytes"]),
    proto=mydata[,"proto"])
mtab1=table(mtab1)
(apply(mtab1,2,function(x)x/sum(x)))

mtab2=data.frame(
    missing_resp_bytes=is.na(mydata[,"resp_bytes"]),
    proto=mydata[,"proto"])
mtab2=table(mtab2)
(apply(mtab2,2,function(x)x/sum(x)))

mtab3=data.frame(
    missing_local_orig=is.na(mydata[,"local_orig"]),
    proto=mydata[,"proto"])
mtab3=table(mtab3)
(apply(mtab3,2,function(x)x/sum(x)))
```
Thus we are missing the local_orig feature for every data point in the data set. We may then consider dropping this entire column as it serves no use to us and we cannot impute the data without prior knowledge of the data set and what it should look like. The duration, orig_bytes and resp_bytes all appear to be missing exactly the same data - on further analysis, we see that whenever one is missing, all three are missing. 

Some initial data cleansing will come from removing the X column and the ts column. The X column is produced by the sampling and since we have a random sample of the data, the ts provides no real information on the data.

```{r}
unique_uid <- mydata[!duplicated(mydata[,c('uid')]),]
unique_uid
```
Thus all our uid's are unique and therefore wont provide us with any extra information either since they will be uncorrelated with the rest of the data. This is the only column with this trait, and all other columns have values which occur more than once so we can drop the uid column too.

```{r}
drop_columns <- c("X","ts","local_orig","uid")
mydata <- mydata[, !names(mydata) %in% drop_columns]
```

```{r}
head(mydata)
```

So we have removed the columns that didn't provide us with any extra information. We will now extract the data we will use for DBSCAN to create clusters. The following code is pulled from Alex's workbook and allows us to pull out 7 of the features to use for DBSCAN and ensures all elements are numeric.

```{r}
miss.me <- vector(length = nrow(mydata))
miss.me <- rep(0, times = nrow(mydata))
for(i in 1:nrow(mydata)) {
	if(is.na(mydata$duration[i])) { miss.me[i] <- 1 }
	}
str(mydata)
mydata.good <- as.data.frame(cbind(id.orig_p = mydata$id.orig_p, id.resp_p = mydata$id.resp_p, 
orig_pkts = mydata$orig_pkts, orig_ip_bytes = mydata$orig_ip_bytes, 
resp_pkts = mydata$resp_pkts, resp_ip_bytes = mydata$resp_ip_bytes))
mydata.good<- cbind(mydata.good, miss.me)
head(mydata.good)
str(mydata.good) # Should be only ints and nums

for(i in 1:ncol(mydata.good)) { mydata.good[,i] <- as.numeric(mydata.good[,i]) }
str(mydata.good)		## All should be nums now
# sum(mydata.good$miss.me)/nrow(mydata.good) ## 82.7% missing

```

I dont want to drop any data that may be important so I'll also use the protocol, connection state and history features in my analysis.
```{r}
proto <- as.factor(c(mydata$proto))
proto <- unclass(proto)

conn_state <- as.factor(c(mydata$conn_state))
conn_state <- unclass(conn_state)

history <- as.factor(c(mydata$history))
history <- unclass(history)

mydata.good$proto <- proto
mydata.good$conn_state <- conn_state
mydata.good$history <- history

mydata.good
```

```{r}
	## We'll do 10-fold CV and then apply DBSCAN, training on 90%
dg <- mydata.good
ran <- sample(1:nrow(dg), 0.9 * nrow(dg))
nor <-function(x) { (x -min(x))/(max(x)-min(x))   }
dg_norm <- as.data.frame(lapply(dg, nor))
	# head(dg_norm)

dg_train <- dg_norm[ran,] 	## extract training set
dg_test <- dg_norm[-ran,]   	## extract testing set
dg_target_cat <- dg[ran, ncol(dg)]
dg_test_cat <- dg[-ran, ncol(dg)]
```

## SVD

Now we can look at running DBSCAN on our data. We first need to perform PCA to figure out how many principle components to use in DBSCAN.

```{r}
library("dbscan")
```

```{r}
dg_train.svd <- svd(dg_train)
```

```{r}
plot(dg_train.svd$d,xlab="Eigenvalue index",ylab="Eigenvalue",log="y")
plot(dg_train.svd$d,xlab="Eigenvalue index",ylab="Eigenvalue")
```
Plotting with a log y-axis and a normal y-axis give strikingly different results. The first eigenvector explains most of the variance and the 5 after that seeming explain almost the same amount of variance between them as the first one. I don't think using just the first eigenvalue would provide that much insight and therefore will use 6 of them (this is still reducing dimensionality by almost half).

```{r}
npcs = 6
```

We now plot the PCA to visualise the clusters formed here. We're not plotting according to any categorical data i.e. normal vs non-normal so we may not get that much information from this.

```{r}
i=1;j=2
plot(dg_train.svd$u[,i],
     dg_train.svd$u[,j],type="p",
     col="#33333311",pch=16,cex=1)
```

## Finding Parameters for DBSCAN

Eps specifies how close the points should be to each other to form a cluster. If the distance is less than eps, they are considered neighbours. We find this number by finding the 'knee' in the plot below. I have chosen to use 7 neighbours here.

```{r}
test=kNNdist(dg_train.svd$u[,1:npcs], k = 7,all=TRUE)
testmin=apply(test,1,min)
```

```{r}
plot(sort(testmin[testmin>1e-8]),log="y")
threshholds= c(0.01,0.001,0.0001,0.00001,0.000001)
abline(h=c(0.01,0.001,0.0001,0.00001,0.000001))
abline(h=0.0001, col="red")
```

So we choose h=0.0001 as our limit since this allows us to capture most of the information here. We also need to define our minimum number of points to form a cluster. The recommendation is to use minPts = 2*dim for large data sets to ensure we find significant clusters and avoid noise so this is what we shall choose.

##DBSCAN

Now we finally perform DBSCAN.

```{r}
minPts = c(20, 25, 30, 35, 40, 45, 50, 75, 100, 125, 150, 175, 200, 225, 250)
clustercounts = c()

for(val in minPts) {
  dbscanres = dbscan(dg_train.svd$u[,1:6],eps = 0.0001,minPts = val)
  clustercounts[val] <- (length(unique(dbscanres$cluster)))
}
```

```{r}
clustercounts
```

Thus we seem to hit some limiting value at 175 since a minPts of 200 has a higher amount of clusters than a minPts of 175. Since we want ou data to be interpretable we may look at reducing the amount of clusters and therefore the noise. Thus we will use an initial minPts of 175.

The output we get from this isn't very insightful though and leaves a lot to be desired. 

```{r}
dbscan175 = dbscan(dg_train.svd$u[,1:6],eps = 0.0001,minPts = 175)
dbscan50 = dbscan(dg_train.svd$u[,1:6],eps=0.0001,minPts = 50)
dbscan30 = dbscan(dg_train.svd$u[,1:6],eps=0.0001,minPts = 30)
```

```{r}
library(cluster)
```

```{r}
# trying to calculate the silhouette score of this clustering to see if its valid or not - currently reports Error: Vector memory exhausted (limit reached?) - I've tried looking into work arounds but cant get anything working so I'll leave this for now.
#ss <- silhouette(dbscanres$cluster, dist(dg_train.svd$u))
```

## Plotting resulting clusters

```{r}
for (k in 1:5){
    a = seq(k+1,6)
    for (l in a){
        if(k==l){next}
        plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab="",
            ylab="",
            col=c("#66666666",rainbow(41))[dbscan175$cluster+1],pch=19,cex=0.5)
    par(new=TRUE)
    }
}
```

```{r}
for (k in 1:5){
    a = seq(k+1,6)
    for (l in a){
        if(k==l){next}
        plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab="",
            ylab="",
            col=c("#66666666",rainbow(41))[dbscan50$cluster+1],pch=19,cex=0.5)
    par(new=TRUE)
    }
}
```

```{r}
for (k in 1:5){
    a = seq(k+1,6)
    for (l in a){
        if(k==l){next}
        plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab="",
            ylab="",
            col=c("#66666666",rainbow(41))[dbscan30$cluster+1],pch=19,cex=0.5)
    par(new=TRUE)
    }
}
```

```{r}
for (k in 1:5){
    a = seq(k+1,6)
    for (l in a){
        if(k==l){next}
        plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab="",
            ylab="",
            col=c("#66666666",rainbow(41))[dbscan175$cluster+1],pch=19,cex=0.5)
    }
}
```

```{r}
for (k in 1:5){
    a = seq(k+1,6)
    for (l in a){
        if(k==l){next}
        plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab="",
            ylab="",
            col=c("#66666666",rainbow(41))[dbscan50$cluster+1],pch=19,cex=0.5)
    }
}
```

```{r}
for (k in 1:5){
    a = seq(k+1,6)
    for (l in a){
        if(k==l){next}
        plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab="",
            ylab="",
            col=c("#66666666",rainbow(41))[dbscan30$cluster+1],pch=19,cex=0.5)
    }
}
```


```{r}
jpeg("Assessment2 DBSCAN Clustering.jpg")

for (k in 1:5){
    a = seq(k+1,6)
    for (l in a){
        if(k==l){next}
        plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab="",
            ylab="",
            col=c("#66666666",rainbow(41))[dbscan30$cluster+1],pch=19,cex=0.5)
    par(new=TRUE)
    }
}
dev.off()
```

```{r}
jpeg("Assessment2 DBSCAN Clustering Eigenvector split.jpg")
op <- par(mfrow=c(5,3))
for (k in 1:5){
  a = seq(k+1,6)
  for (l in a){
      if(k==l){next}
      plot(dg_train.svd$u[,k],
            dg_train.svd$u[,l],xlab=k,
            ylab=l,
            col=c("#66666666",rainbow(41))[dbscan30$cluster+1],pch=19,cex=0.5)
    }
}
par(op)
dev.off()
```

```{r}
dg_train.clustered <- data.frame(dg_train)

dg_train.clustered$cluster <- dbscan175$cluster

dg_train.clustered
```

## M matrix

Now we look at the M matrix produced by Alex that is a sparse matrix showing connections between origin IP's and response IP's.

```{r}
library(Matrix)
library(irlba)
library(Rtsne)
```


```{r}
So1 <- tapply(mydata$id.orig_h, mydata$id.orig_h)
De1 <- tapply(mydata$id.resp_h, mydata$id.resp_h)
Est <- as.matrix(cbind(So1, De1))
M<- sparseMatrix(i=Est[,1], j=Est[,2])
```

```{r}
M.svd = svd(M)
```

```{r}
plot(M.svd$d,xlab="Eigenvalue index",ylab="Eigenvalue",log="y")
plot(M.svd$d,xlab="Eigenvalue index",ylab="Eigenvalue")
```

From the log axis it looks like we need the first ~100 eigenvalues but using the normal plot, it looks like we an get away with using ~30.

```{r}
npcsM = 30
```

```{r}
testM=kNNdist(M.svd$u[,1:npcsM], k = 7,all=TRUE)
testminM=apply(testM,1,min)
```

```{r}
plot(sort(testminM[testminM>1e-15]),log="y")
threshholds= c(0.1,0.01,0.001,0.0001,0.00001,0.000001)
abline(h=c(0.01,0.001,0.0001,0.00001,0.000001))
abline(h=0.5, col="red")
```

It looks like we want eps = 0.5 although we dont seem to get a knee in the data so it's hard to pinpoint this.

```{r}
MminPts = c(200)
clustercountsM = c()

for(val in MminPts) {
  dbscanresM = dbscan(dg_train.svd$u[,1:6],eps = 0.5,minPts = val)
  clustercountsM[val] <- (length(unique(dbscanresM$cluster)))
}
```

```{r}
clustercountsM
```

```{r}
dbscan30 = dbscan(N.svd$u[,1:npcsM],eps=0.5,minPts = 30)
```


References:

1. [Data from SecRepo](https://www.secrepo.com)

2. [Converting categorical variables](https://stackoverflow.com/questions/47922184/convert-categorical-variables-to-numeric-in-r/47923178)

3. [Adding columns to data frames](https://discuss.analyticsvidhya.com/t/how-to-add-a-column-to-a-data-frame-in-r/3278)

4. [Finding Unique Values](https://stackoverflow.com/questions/41906878/r-number-of-unique-values-in-a-column-of-data-frame)

5. [DBSCAN on flowers](https://www.geeksforgeeks.org/dbscan-clustering-in-r-programming/)

6. [Saving Plots](http://www.sthda.com/english/wiki/creating-and-saving-graphs-r-base-graphs)

7. [DBSCAN Parameter Estimation](https://en.wikipedia.org/wiki/DBSCAN#Parameter_estimation)

8. [Finding the knee in kNNDist](https://www.rdocumentation.org/packages/dbscan/versions/1.1-5/topics/kNNdist)

9. [Silhouette Score introduction](https://medium.com/codesmart/r-series-k-means-clustering-silhouette-794774b46586)

10. [Error with silhouette score](https://stackoverflow.com/questions/51248293/error-vector-memory-exhausted-limit-reached-r-3-5-0-macos)

11. [Silhouette Function](https://www.rdocumentation.org/packages/cluster/versions/2.1.0/topics/silhouette)